{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TWrdW148R9kP"
   },
   "source": [
    "# Cloud AI Platform + What-if Tool: end-to-end XGBoost example\n",
    "\n",
    "This notebook shows how to: \n",
    "* Build a binary classification model with XGBoost trained on a [mortgage dataset](https://www.ffiec.gov/hmda/hmdaflat.htm)\n",
    "* Deploy the model to [Cloud AI Platform](https://cloud.google.com/ai-platform/)\n",
    "* Use the [What-if Tool](https://pair-code.github.io/what-if-tool/) on your deployed model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "hSFHu19Rtvbt"
   },
   "outputs": [],
   "source": [
    "#You'll need to install XGBoost on the TF instance\n",
    "!pip3 install xgboost==0.90 witwidget --user --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**After doing a pip install, restart your kernel by selecting kernel from the menu and clicking Restart Kernel before proceeding further**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CosDxuLy7M4Q"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import xgboost as xgb\n",
    "import numpy as np\n",
    "import collections\n",
    "import witwidget\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.utils import shuffle\n",
    "from witwidget.notebook.visualization import WitWidget, WitConfigBuilder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bFIxtguO1In_"
   },
   "source": [
    "## Download and pre-process data\n",
    "\n",
    "In this section we'll:\n",
    "* Download a subset of the mortgage dataset from Google Cloud Storage\n",
    "* Because XGBoost requires all columns to be numerical, we'll convert all categorical columns to dummy columns (0 or 1 values for each possible category value)\n",
    "* Note that we've already done some pre-processing on the original dataset to convert value codes to strings: for example, an agency code of `1` becomes `Office of the Comptroller of the Currency (OCC)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9BngZjdsO6Mr"
   },
   "outputs": [],
   "source": [
    "# Use a small subset of the data since the original dataset is too big for Colab (2.5GB)\n",
    "# Data source: https://www.ffiec.gov/hmda/hmdaflat.htm\n",
    "!gcloud storage cp gs://mortgage_dataset_files/mortgage-small.csv ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "rou3YAIFhQCK"
   },
   "outputs": [],
   "source": [
    "# Set column dtypes for Pandas\n",
    "COLUMN_NAMES = collections.OrderedDict({\n",
    "  'as_of_year': np.int16,\n",
    "  'agency_code': 'category',\n",
    "  'loan_type': 'category',\n",
    "  'property_type': 'category',\n",
    "  'loan_purpose': 'category',\n",
    "  'occupancy': np.int8,\n",
    "  'loan_amt_thousands': np.float64,\n",
    "  'preapproval': 'category',\n",
    "  'county_code': np.float64,\n",
    "  'applicant_income_thousands': np.float64,\n",
    "  'purchaser_type': 'category',\n",
    "  'hoepa_status': 'category',\n",
    "  'lien_status': 'category',\n",
    "  'population': np.float64,\n",
    "  'ffiec_median_fam_income': np.float64,\n",
    "  'tract_to_msa_income_pct': np.float64,\n",
    "  'num_owner_occupied_units': np.float64,\n",
    "  'num_1_to_4_family_units': np.float64,\n",
    "  'approved': np.int8\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "18xiylX4_EPO"
   },
   "outputs": [],
   "source": [
    "# Load data into Pandas\n",
    "data = pd.read_csv(\n",
    "  'mortgage-small.csv', \n",
    "  index_col=False,\n",
    "  dtype=COLUMN_NAMES\n",
    ")\n",
    "data = data.dropna()\n",
    "data = shuffle(data, random_state=2)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "xkt2L4Lh_buo"
   },
   "outputs": [],
   "source": [
    "# Label preprocessing\n",
    "labels = data['approved'].values\n",
    "\n",
    "# See the distribution of approved / denied classes (0: denied, 1: approved)\n",
    "print(data['approved'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bER0GA7wgSHE"
   },
   "outputs": [],
   "source": [
    "data = data.drop(columns=['approved'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dkR8t9I2_fSm"
   },
   "outputs": [],
   "source": [
    "# Convert categorical columns to dummy columns\n",
    "dummy_columns = list(data.dtypes[data.dtypes == 'category'].index)\n",
    "data = pd.get_dummies(data, columns=dummy_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eQNpBaaiZU5r"
   },
   "outputs": [],
   "source": [
    "# Preview the data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Mpr23PcBAnM3"
   },
   "source": [
    "## Train the XGBoost model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qdA3eC9v0tbN"
   },
   "outputs": [],
   "source": [
    "# Split the data into train / test sets\n",
    "x,y = data,labels\n",
    "x_train,x_test,y_train,y_test = train_test_split(x,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "GOveA7JAAmr4"
   },
   "outputs": [],
   "source": [
    "# Train the model, this will take a few minutes to run\n",
    "bst = xgb.XGBClassifier(\n",
    "    objective='reg:logistic'\n",
    ")\n",
    "\n",
    "bst.fit(x_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "0YNhqfWpBGLK"
   },
   "outputs": [],
   "source": [
    "# Get predictions on the test set and print the accuracy score\n",
    "y_pred = bst.predict(x_test)\n",
    "acc = accuracy_score(y_test, y_pred.round())\n",
    "print(acc, '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_UfHZJfFhcRO"
   },
   "outputs": [],
   "source": [
    "# Print a confusion matrix\n",
    "print('Confusion matrix:')\n",
    "cm = confusion_matrix(y_test, y_pred.round())\n",
    "cm = cm / cm.astype(np.float).sum(axis=1)\n",
    "print(cm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VFAZoUs2vvxf"
   },
   "outputs": [],
   "source": [
    "# Save the model so we can deploy it\n",
    "bst.save_model('model.bst')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Lop--kefvU5a"
   },
   "source": [
    "## Deploy model to AI Platform\n",
    "\n",
    "Copy your saved model file to Cloud Storage and deploy the model to AI Platform. In order for this to work, you'll need the Cloud AI Platform Models API enabled. Update the values in the next cell with the info for your GCP project. Replace GCP_PROJECT with the value in the lab page for GCP Project ID in the left pane, replace MODEL_BUCKET with gs:// with the value for BucketName appended, and replace MODEL_NAME with a name for your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "m454ISwGiP7Q"
   },
   "outputs": [],
   "source": [
    "GCP_PROJECT = 'YOUR_GCP_PROJECT'\n",
    "MODEL_BUCKET = 'gs://your_storage_bucket'\n",
    "MODEL_NAME = 'your_model_name' # You'll create this model below\n",
    "VERSION_NAME = 'v1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "f9VlzVTEvtiq"
   },
   "outputs": [],
   "source": [
    "# Copy your model file to Cloud Storage\n",
    "!gcloud storage cp ./model.bst $MODEL_BUCKET"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configure gcloud to use your project\n",
    "!gcloud config set project $GCP_PROJECT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ew-GyuFHine9"
   },
   "outputs": [],
   "source": [
    "# Create a model\n",
    "!gcloud ai-platform models create $MODEL_NAME --regions us-central1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "h14oBXAtvUYE"
   },
   "outputs": [],
   "source": [
    "# Create a version, this will take ~2 minutes to deploy\n",
    "!gcloud ai-platform versions create $VERSION_NAME \\\n",
    "--model=$MODEL_NAME \\\n",
    "--framework='XGBOOST' \\\n",
    "--runtime-version=1.15 \\\n",
    "--origin=$MODEL_BUCKET \\\n",
    "--staging-bucket=$MODEL_BUCKET \\\n",
    "--python-version=3.7 \\\n",
    "--project=$GCP_PROJECT \\\n",
    "--region=global"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-8xNn8EhgUi7"
   },
   "source": [
    "## Using the What-if Tool to interpret your model\n",
    "Once your model has deployed, you're ready to connect it to the What-if Tool using the `WitWidget`.\n",
    "**Note**: You can ignore the message `TypeError(unsupported operand type(s) for -: 'int' and 'list')` while creating a What-if Tool visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "QzDqyqgzvV0E"
   },
   "outputs": [],
   "source": [
    "# Format a subset of the test data to send to the What-if Tool for visualization\n",
    "# Append ground truth label value to training data\n",
    "\n",
    "# This is the number of examples you want to display in the What-if Tool\n",
    "num_wit_examples = 500\n",
    "test_examples = np.hstack((x_test[:num_wit_examples].values,y_test[:num_wit_examples].reshape(-1,1)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dqAbAmxkgW4p"
   },
   "outputs": [],
   "source": [
    "# Create a What-if Tool visualization, it may take a minute to load\n",
    "# See the cell below this for exploration ideas\n",
    "\n",
    "# This prediction adjustment function is needed as this xgboost model's\n",
    "# prediction returns just a score for the positive class of the binary\n",
    "# classification, whereas the What-If Tool expects a list of scores for each\n",
    "# class (in this case, both the negative class and the positive class).\n",
    "def adjust_prediction(pred):\n",
    "  return [1 - pred, pred]\n",
    "\n",
    "config_builder = (WitConfigBuilder(test_examples.tolist(), data.columns.tolist() + ['mortgage_status'])\n",
    "  .set_ai_platform_model(GCP_PROJECT, MODEL_NAME, VERSION_NAME, adjust_prediction=adjust_prediction)\n",
    "  .set_target_feature('mortgage_status')\n",
    "  .set_label_vocab(['denied', 'approved']))\n",
    "WitWidget(config_builder, height=800)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_B2BskDk55rk"
   },
   "source": [
    "## What-if Tool exploration ideas\n",
    "\n",
    "* **Individual data points**: the default graph shows all data points from the test set, colored by their ground truth label (approved or denied)\n",
    "  * Try selecting data points close to the middle and tweaking some of their feature values. Then run inference again to see if the model prediction changes\n",
    "  * Select a data point and then select the \"Show nearest counterfactual datapoint\" radio button. This will highlight a data point with feature values closest to your original one, but with the opposite prediction\n",
    "  \n",
    "* **Binning data**: create separate graphs for individual features\n",
    "  * From the \"Binning - X axis\" dropdown, try selecting one of the agency codes, for example \"Department of Housing and Urban Development (HUD)\". This will create 2 separate graphs, one for loan applications from the HUD (graph labeled 1), and one for all other agencies (graph labeled 0). This shows us that loans from this agency are more likely to be denied\n",
    "  \n",
    "* **Exploring overall performance**: Click on the \"Performance & Fairness\" tab to view overall performance statistics on the model's results on the provided dataset, including confusion matrices, PR curves, and ROC curves.\n",
    "   * Experiment with the threshold slider, raising and lowering the positive classification score the model needs to return before it decides to predict \"approved\" for the loan, and see how it changes accuracy, false positives, and false negatives.\n",
    "   * On the left side \"Slice by\" menu, select \"loan_purpose_Home purchase\". You'll now see performance on the two subsets of your data: the \"0\" slice shows when the loan is not for a home purchase, and the \"1\" slice is for when the loan is for a home purchase. Check out the accuracy, false postive, and false negative rate between the two slices to look for differences in performance. If you expand the rows to look at the confusion matrices, you can see that the model predicts \"approved\" more often for home purchase loans.\n",
    "   * You can use the optimization buttons on the left side to have the tool auto-select different positive classification thresholds for each slice in order to achieve different goals. If you select the \"Demographic parity\" button, then the two thresholds will be adjusted so that the model predicts \"approved\" for a similar percentage of applicants in both slices. What does this do to the accuracy, false positives and false negatives for each slice?\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "What-If Tool with XGBoost Cloud AI Platform Model - end-to-end",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
